Los Angeles is a city of around 4 Mio1 inhabitants from diverse backgrounds (about half2 of the population is latino) and had a crime rate of about 6353 per 100’000 inhabitants per year in 2015. The city hosts the biggest port complex in the US and is an illegal drugs hub4 for the country. Finding a way to foster this diversity so as to take advantage of the city’s economical potential and reduce crime would provide more room for children to develop, achieve high education levels and freely roam around the city.

The aim of this paper is to identify how demographic data at the zip code level in Los Angeles impact crime rates in 2010. More precisely, the two following hypotheses are formulated and tested here:

For the purpose of the study, three data sets are used:

In the following, the data is explored taking into account the spatial nature of the data, i.e. bivariate spatial maps, centrographic statistics, spatial error/lag weighted OLS model and geographically weighted model. First, an exploratory data analysis is performed in order to assess the distribution of the variables of interest, followed by a non-spatial regression analysis.

Descriptive Statistics

The histograms below show that average household size and median age are centered around 3 and 40 respectively. Crime rate has a high number of 1, 2 and 3 values (the underlying data has no 0 values). This might be due to the lack of data in certain regions. Those 1s values were not dismissed in order to not dismiss zip code areas arbitrarily in the final analysis, instead the spatial plots use quantiles, which allow to categorize all low values together and allow for comparison with higher and more reasonable estimates of crime rates.

The interaction between the explanatory variables (median age and average household size) and the dependent variable (number of crimes per zip code) can be seen on the plot below. The natural log of number of crimes was used in order to standardize the data, since it is originally left skewed. It seems that that no particular relationship exists on a first look, this will be tested by the regressions further below.

2 variable maps

Number of Crimes VS Average Household Size

The spatial data delivers more detailed information on certain underlying relationships. Under the hypothesis stated in the introduction, one would expect that the more a circle tends to red (i.e. the higher the average household size), the more an area tends to blue (the higher the number of crimes). This is however not the case, except for three areas north-west of Santa Monica.

Number of Crimes VS Median Age

Similarily, for median age against crime rates no particular relationship can be read from the plot.

Centrographic statistics and maps

In the following is an attempt of exploring the relationships based on centroids and standard deviation ellipses (SDE). The independent variables are split into two groups. Above (red) VS below (orange) 40 for the median age, above (red) VS below (orange) 3 for the average household size.

number of crimes VS median age

number of crimes VS median age

The plot above shows that younger people spread in the north and in the south of the city mostly, whereas older people spread along a west-east axis. Given that the number of crimes is heterogeneous along any of those axis, a conclusion is again hard to draw on the effect of age on crime rates. The plot below, illustrates the relationship between crimes and household size.

number of crimes VS average household size

number of crimes VS average household size

Households, whether big or small, seem to be spread equally along the city, since both SDEs have similar shape. The following chapter deals with the spatial data with OLS methods.

Spatial Regression Analysis

The variables Average Household Size and Median Age will be considered in the model, given that they both relate to the hypotheses mentioned in the introduction. At first an ordinary a non-spatial OLS regression is performed, the assumptions for a BLUE estimator tested and its residuals mapped, before comparing these with the residuals of a lag or error term to control for spatial autocorrelation. Finally, a Geographically Weighted Regression Model is fitted on the same data in order to account for any inherent spatial autocorrelation.

Ordinary Linear Regression

Below is a regression analysis for the two independent variables (average household size and median age). It is deceiving in terms of explained variance and of the explanatory value of the independent variable median age. Average household size is significantly different from zero at a 5% confidence level though: a bigger household would imply more crimes as stated in the hypothesis in the introduction. Generally a high R-squared is a sign of possible multicollinearity. Although R-squared is particularily low here, one might still need to perform formal tests in order to assess the absence of multicollinearity.

Dependent variable:
log(crimes)
Average.Household.Size 0.675**
(0.305)
Median.Age 0.005
(0.028)
Constant 3.398**
(1.441)
Observations 242
R2 0.020
Adjusted R2 0.012
Residual Std. Error 3.993 (df = 239)
F Statistic 2.457* (df = 2; 239)
Note: p<0.1; p<0.05; p<0.01

Tests for Multicollinearity

Several multicollinearity diagnostic measures exist. Although the focus will be on condition number here, it can be seen from the table below than any other standard test would deny the existence of substancial multicollinearity.

results detection
Determinant 0.9881100 0
Farrar Chi-Square 2.8288247 0
Red Indicator 0.1090411 0
sum of Lambda Invers 2.0240660 0
Theil Indicator 0.0036336 0
Condition Number 11.8175111 0

Tests for Heteroskedasticity

A second assumption required for OLS is homoskedasticity of the errors. The Breusch-Pagan test below shows that the null hypothesis (constant variance) can not be rejected at the \(5\%\) confidence level.

Breusch-Pagan test
Chi2 0.0650858
Prob > Chi2 0.7986310
DF 1.0000000

Once classical OLS assumptions are verified, it remains to see if the spatial element plays a role. Spatial dependence is measured by Moran’s I test for residual spatial autocorrelation and Langrange Multiplier tests. All tests below are different ways of testing whether a weighted regression with spatial weights would imply significant spatial lag or spatial coefficients: for \(y = X \beta + \rho W y + u\), with \(u = \lambda W u + e\) and with \(W\) a spatial weights matrix; we test whether \(\rho = 0\) (spatial lag) or we test whether \(\lambda = 0\) (spatial error). Note that the SARMA tests for the existence of both at the same time. Given that all p-values are significant at the \(5\%\) confidence level, we can confidently reject the latter hypothesis and conclude that a spatial interaction is existant given a strong spatial autocorrelation of the residuals.

Moran’s I (error) Lagrange Multiplier (error) Robust LM (error) Lagrange Multiplier (lag) Robust LM (lag) Lagrange Multiplier (SARMA)
statistic 6.176374 36.22929 4.1895123 41.95159 9.9118129 46.1411
p.value 0.000000 0.00000 0.0406748 0.00000 0.0016422 0.0000
parameter 1.000000 1.00000 1.0000000 1.00000 1.0000000 2.0000

Distribution of the Residuals

As a last model check, residuals are plotted below. Some clustering seems to occur in different parts of the city and especially along the coast.

OLS with lag or error term to control for spatial autocorrelation

In order to deal with spatial autocorrelation, an OLS with spatial lag and an OLS with spatial errors are fitted. Model selection will be based on AIC.

Spatial lag model

Spatial lag can indicate a diffusion process: local events predict an increased likelihood of similar events in neighboring places.

Dependent variable:
log(crimes)
Average.Household.Size 0.411
(0.274)
Median.Age -0.0001
(0.025)
Constant 1.692
(1.350)
Observations 242
Log Likelihood -658.794
sigma2 12.833
Akaike Inf. Crit. 1,327.587
Wald Test 46.066*** (df = 1)
LR Test 36.270*** (df = 1)
Note: p<0.1; p<0.05; p<0.01

No parameter is significantly different from zero in the lag model above, but it’s AIC (1327.59) is lower than for the original model (1361.86), indicating better fit.

Spatial error model

Spatial error implies omitted covariates that are spatially correlated and that affect inference.

Dependent variable:
log(crimes)
Average.Household.Size 0.233
(0.271)
Median.Age -0.003
(0.025)
Constant 4.936***
(1.300)
Observations 242
Log Likelihood -660.183
sigma2 12.986
Akaike Inf. Crit. 1,330.367
Wald Test 44.821*** (df = 1)
LR Test 33.490*** (df = 1)
Note: p<0.1; p<0.05; p<0.01

When running the spatial error model above, AIC (1330) goes up by a little, but the intercept becomes significantly different than zero. The intercept carries little inferential information, therefore this will not be considered as a gain compared to the lag model. The lag model will be preferred due to its lower AIC. Below, the residuals for the lag model and for the error model are plotted. They can be compared with the plot of the residuals of the original model further above.

Geographically Weighted Regression Model

Another way of dealing with the intrinsic spatial interactions is to fit a geographically weighted regression model.

The plot above shows where median age influence violent crime across LA. It seems that patches of strongly influenced neighborhoods exist, implying that the influence is not uniform accross the city.

The plot above shows where average household size influence violent crime across LA. Similarily, it seems that patches of strongly influenced neighborhoods exist. A clearer pattern is distinguishable: the outskirts of the city are more influenced by average household size. This might be due to the fact that poorer people live in the outskirts and that a bigger household has a more significant impact on the ability to foster and educate children.

Conclusion

The two hypotheses stated above can not be confirmed by the statistical analysis. It seems that the heterogeneity in crime data (as can be seen in the plots above) is not well modelled by median age. Average household size seems to bear more inferential power: in the linear regression, a higher average household size did involve more crimes. More specifically, as implied by the geographically weighted regression, a higher average household size in the suburbs of LA implied more violent crimes. Note that in future studies other variables that are available in the crime and Zip datasets could be further investigated, in order to better model and predict crime rates.

R Script

## Loading and installing packages

knitr::opts_chunk$set(echo = F, message = F, warning = F)

packages <- c("rgdal", "foreign", "gdata", "ggmap", "ggplot2",
              "plyr", "rgeos", "sf", "ggrepel", "dplyr", "sp", "aspace",
              "spdep", "bookdown", "stringr", "maptools", "leaflet", "broom", "stargazer",
              "RColorBrewer", "mctest", "olsrr", "knitr", "spgwr", "htmlwidgets", "htmltools")

package.check <- lapply(packages, FUN = function(x) {
  if (!require(x, character.only = T)) install.packages(x)
  if (! (x %in% (.packages() )))  library(x, character.only = T)
})

# leaflet map titles
tag.map.title <- tags$style(HTML("
  .leaflet-control.map-title { 
    transform: translate(-50%,20%);
    position: fixed !important;
    left: 50%;
    text-align: center;
    padding-left: 10px; 
    padding-right: 10px; 
    background: rgba(255,255,255,0.75);
    font-weight: bold;
    font-size: 28px;
  }
"))



p <- read.csv("../Research Data/2010_Census_Populations_by_Zip_Code.csv")
load("../Research Data/crime.RData")
load(file = "../Research Data/zipCrimes.RData")

names(zc)[2] <- names(p)[1] <- "zip"

zCrimes <- zc
zCrimes@data <- merge(zc@data, p, by = "zip")
# zCrimes <- zCrimes[!is.na(zCrimes$crimes),]

# writeOGR(obj=zCrimes, driver="ESRI Shapefile", "../Research Data/zCrimes")



par(mfrow=c(1,3))
hist(zCrimes$crimes, xlab = "crimes since 2010", main=NULL)
hist(zCrimes$Average.Household.Size, xlab = "average household size in 2010", main=NULL)
hist(zCrimes$Median.Age, xlab = "median age in 2010", main=NULL)
par(mfrow=c(1,1))


par(mfrow=c(1,2))
plot(zCrimes$Median.Age, log(zCrimes$crimes), xlab = "median age", ylab = "ln(crimes)")
plot(zCrimes$Average.Household.Size, log(zCrimes$crimes), , xlab = "average household size", ylab = "ln(crimes)")
par(mfrow=c(1,1))


centroids <- as.data.frame(gCentroid(zCrimes,byid=TRUE))

# pal <- colorNumeric(
#   palette = "YlGnBu",
#   domain = zCrimes$crimes
# )

qpal <- colorQuantile("YlGnBu", zCrimes$crimes, n = 5)
qHH <- colorQuantile("YlOrRd", zCrimes$Average.Household.Size, n = 5)

leaflet(zCrimes) %>% addPolygons(weight = 1, smoothFactor = 0.5,
    opacity = 1.0, fillOpacity = 0.5,
    color = ~qpal(crimes),
    highlightOptions = highlightOptions(color = "white", weight = 2,
      bringToFront = TRUE)) %>% 
                addCircles(lng = ~centroids$x, lat = ~centroids$y, weight = 1, color = ~qHH(Average.Household.Size),
                  radius = 800, popup = ~zCrimes$zip, opacity = 0.9, fillOpacity = 0.8 ) %>% 
                addTiles() %>%
    addLegend(pal = qpal, values = ~crimes, opacity = 1) %>%
    addLegend(pal = qHH, values = ~Average.Household.Size, opacity = 1)

qMA <- colorQuantile("YlOrRd", zCrimes$Median.Age, n = 5)


leaflet(zCrimes) %>% addPolygons(weight = 1, smoothFactor = 0.5,
    opacity = 1.0, fillOpacity = 0.5,
    color = ~qpal(crimes),
    highlightOptions = highlightOptions(color = "white", weight = 2,
      bringToFront = TRUE)) %>% 
                addCircles(lng = ~centroids$x, lat = ~centroids$y, weight = 1, color = ~qMA(Median.Age),
                  radius = 800, popup = ~zCrimes$zip, opacity = 0.9, fillOpacity = 0.8 ) %>% 
                addTiles() %>%
    addLegend(pal = qpal, values = ~crimes, opacity = 1) %>%
    addLegend(pal = qMA, values = ~Median.Age, opacity = 1)


zCrimes@data$id = rownames(zCrimes@data)
crimePoints = fortify(zCrimes, region="id")
crimesDf = join(crimePoints, zCrimes@data, by="id")


# c$weapon <- !c$Weapon.Description %in% c("", "STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)", "VERBAL THREAT")
# c$verbal <- c$Weapon.Description %in% "VERBAL THREAT"
# c$fists <- c$Weapon.Description %in% "STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)"
# 
# clean_coords <- gsub(pattern = '[()]', replacement = '', x = c$Location)
# split_coords <- str_split(string = clean_coords, pattern = ', ', n = 2, simplify = T)
# c$lat <- as.numeric(split_coords[,1])
# c$lon <- as.numeric(split_coords[,2])


zCrimesLoc <- cbind(zCrimes@data, centroids)

ggplot(crimesDf) + 
  geom_polygon(aes(long,lat,group=group, fill = crimes)) +
  coord_equal() + 
  stat_ellipse(data = subset(zCrimesLoc, Median.Age > 40), aes(x = x, y = y), level=0.5, color = "red") +
  geom_point(data = subset(zCrimesLoc, Median.Age > 40), aes(x = mean(x), y = mean(y)), color = "red", size = 0.5) +
  stat_ellipse(data = subset(zCrimesLoc, Median.Age <= 40), aes(x = x, y = y), level=0.5, color = "orange") +
  geom_point(data = subset(zCrimesLoc, Median.Age <= 40), aes(x = mean(x), y = mean(y)), color = "orange", size = 0.5) +
  theme_void()


ggplot(crimesDf) + 
  geom_polygon(aes(long,lat,group=group, fill = crimes)) +
  coord_equal() + 
  stat_ellipse(data = subset(zCrimesLoc, Average.Household.Size > 3), aes(x = x, y = y), level=0.5, color = "red") +
  geom_point(data = subset(zCrimesLoc, Average.Household.Size > 3), aes(x = mean(x), y = mean(y)), color = "red", size = 0.5) +
  stat_ellipse(data = subset(zCrimesLoc, Average.Household.Size <= 3), aes(x = x, y = y), level=0.5, color = "orange") +
  geom_point(data = subset(zCrimesLoc, Average.Household.Size <= 3), aes(x = mean(x), y = mean(y)), color = "orange", size = 0.5) +
  theme_void()


regHH <- lm(log(crimes) ~ Average.Household.Size + Median.Age, data = zCrimes)
stargazer(regHH, no.space = T, type = "html")

# Multicollinearity

mc <- omcdiag(zCrimes@data[colnames(zCrimes@data) %in% c("Average.Household.Size", "Median.Age")], log(zCrimes$crimes))

kable(mc$odiags, "html", booktabs = T)


# HETEROSKEDASTICITY

# ols_test_bartlett(as.data.frame(zCrimes@data$"Average.Household.Size"), as.data.frame(zCrimes@data$"Median.Age"))
bp <- ols_test_breusch_pagan(regHH)

m <- matrix(c(bp$bp, bp$p, 1))
rownames(m) <- c("Chi2", "Prob > Chi2", "DF")
kable(m, "html", booktabs = T, row.names = T, caption = "Breusch-Pagan test")


# Spatial Dependence

spatmatrix <- poly2nb(zCrimes)
w <- nb2listw(spatmatrix, zero.policy = T)

# Moran's I
moran <- lm.morantest(regHH, listw = w, zero.policy = T)# Moran’s I test for residual spatial autocorrelation
moranStat <- matrix(c(moran$statistic, moran$p.value, 1))
rownames(moranStat) <- c("statistic", "p.value", "parameter")
colnames(moranStat) <- "Moran's I (error)"

# Lagrange Multiplier test
LM <- lm.LMtests(regHH, listw = w, zero.policy = T, test = c("LMerr", "RLMerr", "LMlag", "RLMlag", "SARMA")) 

LMrows <- c("statistic", "p.value", "parameter")
m <- matrix(NA, nrow = 3, ncol = length(LM))
rownames(m) <- LMrows
colnames(m) <- c("Lagrange Multiplier (error)", "Robust LM (error)", "Lagrange Multiplier (lag)", "Robust LM (lag)", "Lagrange Multiplier (SARMA)")
# colnames(m) <- 
  
# everything in one table
for(i in 1:length(LM)){
  r <- 1
  for(j in LMrows){
    m[r, i] <- LM[[i]][[j]]
    r <- r + 1
  }
}
m <- cbind(moranStat, m)
kable(m, "html", booktabs = T, row.names = T)



# Residuals

#extract residuals
zCrimes@data$lmRes[!is.na(zCrimes$crimes)] <- resid(regHH) #residual lm

#view residuals for linear model
qpal <- colorQuantile("OrRd", zCrimes@data$lmRes, n=9) 

leaflet(zCrimes) %>%
  addPolygons(stroke = FALSE, fillOpacity = .8, smoothFactor = 0.2, color = ~qpal(lmRes)
  ) %>%
  addTiles()


#run a spatial lag model
lagModel <- lagsarlm(log(crimes) ~ Average.Household.Size + Median.Age, data = zCrimes, w,
                  zero.policy = T)
stargazer(lagModel, no.space = T, type = "html")
#note that rho is listed below other model coefs.


#run a spatial error model
errModel <- errorsarlm(log(crimes) ~ Average.Household.Size + Median.Age, data = zCrimes, w, zero.policy = T)
stargazer(errModel, no.space = T, type = "html")




#extract residuals for lag model
zCrimes@data$lagRes[!is.na(zCrimes$crimes)] <- resid(lagModel) #residuals lag

#view residuals for lag model
qpal<-colorQuantile("OrRd", zCrimes@data$lagRes, n=9) 

title <- tags$div(tag.map.title, HTML("spatial lag model residuals"))  

leaflet(zCrimes) %>%
  addPolygons(stroke = FALSE, fillOpacity = .8, smoothFactor = 0.2, color = ~qpal(lagRes)
  ) %>%
  addTiles() %>%
  addControl(title, position = "topleft", className="map-title")


#extract residuals for err model
zCrimes@data$errRes[!is.na(zCrimes$crimes)] <- resid(errModel) #residual err

#compare with residuals for err model
qpal<-colorQuantile("OrRd", zCrimes@data$errRes, n=9) 

title <- tags$div(tag.map.title, HTML("spatial error model residuals"))  

leaflet(zCrimes) %>%
  addPolygons(stroke = FALSE, fillOpacity = .8, smoothFactor = 0.2, color = ~qpal(errRes)
  ) %>%
  addTiles() %>%
  addControl(title, position = "topleft", className="map-title")



centr <- gCentroid(zCrimes, byid = TRUE)  #byid tells function to add centroids per block group

# create SpatialPointsDataFrame to add all data from chi.poly@data to centr@data
centr <- SpatialPointsDataFrame(centr, data= zCrimes@data) 


#Begin GWR prep: First, calculate bandwidth

#calculate optimal bandwidth for model (uses cross validation to find kernel bandwidth that
#generates best model...e.g.-kernel that minimizes RSS)

GWRbandwidth <- gwr.sel(log(crimes) ~ Average.Household.Size + Median.Age, data=centr, verbose = F, 
                        coords=coordinates(centr),adapt=T)# coordinates() extracts x, y coordinates
#adapt tells function to calculate adaptive bandwidth

#run the gwr model
gwr.model <- gwr(log(crimes) ~ Average.Household.Size + Median.Age, 
                 data=subset(centr, !is.na(crimes)), coords=coordinates, adapt=GWRbandwidth,
                 hatmatrix=TRUE, se.fit=TRUE)
#print the results of the model

#extract results for each variable
results<-as.data.frame(gwr.model$SDF)

#attach coefficients to original dataframe
centr$Median.AgeCoef[!is.na(zCrimes$crimes)]  <- results$Median.Age
centr$Average.Household.SizeCoef[!is.na(zCrimes$crimes)]  <- results$Average.Household.Size

#now plot the various GWR coefficients

#Does the median age influence violent crime uniformly across LA?

title <- tags$div(tag.map.title, HTML("median age influence on violent crime"))  

qpal<-colorQuantile("OrRd", centr@data$Median.Age, n=9) 

leaflet(centr) %>%
  addCircleMarkers(radius=3,color = ~qpal(centr@data$Median.Age)
  ) %>%
  addTiles() %>%
  addControl(title, position = "topleft", className="map-title")



#Does the average household size influence violent crime uniformly across LA?
title <- tags$div(tag.map.title, HTML("average household size influence on violent crime"))  

qpal<-colorQuantile("OrRd", centr@data$Average.Household.Size, n=9) 

leaflet(centr) %>%
  addCircleMarkers(radius=3,color = ~qpal(centr@data$Average.Household.Size)
  ) %>%
  addTiles() %>%
  addControl(title, position = "topleft", className="map-title")

  1. https://www.census.gov/quickfacts/losangelescitycalifornia

  2. https://www.census.gov/quickfacts/losangelescitycalifornia

  3. https://ucr.fbi.gov/crime-in-the-u.s/2015/crime-in-the-u.s.-2015/offenses-known-to-law-enforcement/violent-crime

  4. https://www.businessinsider.com/dea-maps-of-mexican-cartels-in-the-us-2016-12

  5. https://catalog.data.gov/dataset/crime-data-from-2010-to-present

  6. https://catalog.data.gov/dataset/crime-data-from-2010-to-present

  7. https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2018&layergroup=ZIP+Code+Tabulation+Areas